SML HW4 : Formula 1 World Championship Analysis Using Unsupervised Learning¶

Team Members: Alekhiya, Devi and Hrishabh¶

Question 1: Can we group F1 drivers by their race-day performance profiles across seasons using data from results, qualifying, lap times and pit stops?¶

In [35]:
%pip install yellowbrick
%pip install plotly
%pip install tabulate
Requirement already satisfied: yellowbrick in /opt/anaconda3/lib/python3.12/site-packages (1.5)
Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in /opt/anaconda3/lib/python3.12/site-packages (from yellowbrick) (3.8.4)
Requirement already satisfied: scipy>=1.0.0 in /opt/anaconda3/lib/python3.12/site-packages (from yellowbrick) (1.11.4)
Requirement already satisfied: scikit-learn>=1.0.0 in /opt/anaconda3/lib/python3.12/site-packages (from yellowbrick) (1.4.2)
Requirement already satisfied: numpy>=1.16.0 in /opt/anaconda3/lib/python3.12/site-packages (from yellowbrick) (1.26.4)
Requirement already satisfied: cycler>=0.10.0 in /opt/anaconda3/lib/python3.12/site-packages (from yellowbrick) (0.11.0)
Requirement already satisfied: contourpy>=1.0.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.2.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (23.2)
Requirement already satisfied: pillow>=8 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.9.0.post0)
Requirement already satisfied: joblib>=1.2.0 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-learn>=1.0.0->yellowbrick) (2.2.0)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: plotly in /opt/anaconda3/lib/python3.12/site-packages (5.22.0)
Requirement already satisfied: tenacity>=6.2.0 in /opt/anaconda3/lib/python3.12/site-packages (from plotly) (8.2.2)
Requirement already satisfied: packaging in /opt/anaconda3/lib/python3.12/site-packages (from plotly) (23.2)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: tabulate in /opt/anaconda3/lib/python3.12/site-packages (0.9.0)
Note: you may need to restart the kernel to use updated packages.
In [87]:
import re
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from sklearn.metrics import silhouette_score
from collections import Counter
import plotly.graph_objects as go
from matplotlib.colors import LinearSegmentedColormap
from matplotlib import colors as mcolors
from tabulate import tabulate
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from fancyimpute import SoftImpute

Load Data¶

In [90]:
dataframes = {
    name: pd.read_csv(f"{name}.csv") for name in [
        "circuits", "constructor_results", "constructor_standings", "constructors",
        "driver_standings", "drivers", "lap_times", "pit_stops",
        "qualifying", "races", "results", "seasons",
        "sprint_results", "status"
    ]
}
In [92]:
# Visualize shape and sample of each dataset
for name, df in dataframes.items():
    print(f"\n==== {name.upper()} ====")
    print(f"Shape: {df.shape}")
    print(df.head())
==== CIRCUITS ====
Shape: (77, 9)
   circuitId   circuitRef                            name      location  \
0          1  albert_park  Albert Park Grand Prix Circuit     Melbourne   
1          2       sepang    Sepang International Circuit  Kuala Lumpur   
2          3      bahrain   Bahrain International Circuit        Sakhir   
3          4    catalunya  Circuit de Barcelona-Catalunya      Montmeló   
4          5     istanbul                   Istanbul Park      Istanbul   

     country       lat        lng  alt  \
0  Australia -37.84970  144.96800   10   
1   Malaysia   2.76083  101.73800   18   
2    Bahrain  26.03250   50.51060    7   
3      Spain  41.57000    2.26111  109   
4     Turkey  40.95170   29.40500  130   

                                                 url  
0  http://en.wikipedia.org/wiki/Melbourne_Grand_P...  
1  http://en.wikipedia.org/wiki/Sepang_Internatio...  
2  http://en.wikipedia.org/wiki/Bahrain_Internati...  
3  http://en.wikipedia.org/wiki/Circuit_de_Barcel...  
4         http://en.wikipedia.org/wiki/Istanbul_Park  

==== CONSTRUCTOR_RESULTS ====
Shape: (12625, 5)
   constructorResultsId  raceId  constructorId  points status
0                     1      18              1    14.0     \N
1                     2      18              2     8.0     \N
2                     3      18              3     9.0     \N
3                     4      18              4     5.0     \N
4                     5      18              5     2.0     \N

==== CONSTRUCTOR_STANDINGS ====
Shape: (13391, 7)
   constructorStandingsId  raceId  constructorId  points  position  \
0                       1      18              1    14.0         1   
1                       2      18              2     8.0         3   
2                       3      18              3     9.0         2   
3                       4      18              4     5.0         4   
4                       5      18              5     2.0         5   

  positionText  wins  
0            1     1  
1            3     0  
2            2     0  
3            4     0  
4            5     0  

==== CONSTRUCTORS ====
Shape: (212, 5)
   constructorId constructorRef        name nationality  \
0              1        mclaren     McLaren     British   
1              2     bmw_sauber  BMW Sauber      German   
2              3       williams    Williams     British   
3              4        renault     Renault      French   
4              5     toro_rosso  Toro Rosso     Italian   

                                                 url  
0               http://en.wikipedia.org/wiki/McLaren  
1            http://en.wikipedia.org/wiki/BMW_Sauber  
2  http://en.wikipedia.org/wiki/Williams_Grand_Pr...  
3  http://en.wikipedia.org/wiki/Renault_in_Formul...  
4   http://en.wikipedia.org/wiki/Scuderia_Toro_Rosso  

==== DRIVER_STANDINGS ====
Shape: (34863, 7)
   driverStandingsId  raceId  driverId  points  position positionText  wins
0                  1      18         1    10.0         1            1     1
1                  2      18         2     8.0         2            2     0
2                  3      18         3     6.0         3            3     0
3                  4      18         4     5.0         4            4     0
4                  5      18         5     4.0         5            5     0

==== DRIVERS ====
Shape: (861, 9)
   driverId   driverRef number code  forename     surname         dob  \
0         1    hamilton     44  HAM     Lewis    Hamilton  1985-01-07   
1         2    heidfeld     \N  HEI      Nick    Heidfeld  1977-05-10   
2         3     rosberg      6  ROS      Nico     Rosberg  1985-06-27   
3         4      alonso     14  ALO  Fernando      Alonso  1981-07-29   
4         5  kovalainen     \N  KOV    Heikki  Kovalainen  1981-10-19   

  nationality                                             url  
0     British     http://en.wikipedia.org/wiki/Lewis_Hamilton  
1      German      http://en.wikipedia.org/wiki/Nick_Heidfeld  
2      German       http://en.wikipedia.org/wiki/Nico_Rosberg  
3     Spanish    http://en.wikipedia.org/wiki/Fernando_Alonso  
4     Finnish  http://en.wikipedia.org/wiki/Heikki_Kovalainen  

==== LAP_TIMES ====
Shape: (589081, 6)
   raceId  driverId  lap  position      time  milliseconds
0     841        20    1         1  1:38.109         98109
1     841        20    2         1  1:33.006         93006
2     841        20    3         1  1:32.713         92713
3     841        20    4         1  1:32.803         92803
4     841        20    5         1  1:32.342         92342

==== PIT_STOPS ====
Shape: (11371, 7)
   raceId  driverId  stop  lap      time duration  milliseconds
0     841       153     1    1  17:05:23   26.898         26898
1     841        30     1    1  17:05:52   25.021         25021
2     841        17     1   11  17:20:48   23.426         23426
3     841         4     1   12  17:22:34   23.251         23251
4     841        13     1   13  17:24:10   23.842         23842

==== QUALIFYING ====
Shape: (10494, 9)
   qualifyId  raceId  driverId  constructorId  number  position        q1  \
0          1      18         1              1      22         1  1:26.572   
1          2      18         9              2       4         2  1:26.103   
2          3      18         5              1      23         3  1:25.664   
3          4      18        13              6       2         4  1:25.994   
4          5      18         2              2       3         5  1:25.960   

         q2        q3  
0  1:25.187  1:26.714  
1  1:25.315  1:26.869  
2  1:25.452  1:27.079  
3  1:25.691  1:27.178  
4  1:25.518  1:27.236  

==== RACES ====
Shape: (1125, 18)
   raceId  year  round  circuitId                   name        date  \
0       1  2009      1          1  Australian Grand Prix  2009-03-29   
1       2  2009      2          2   Malaysian Grand Prix  2009-04-05   
2       3  2009      3         17     Chinese Grand Prix  2009-04-19   
3       4  2009      4          3     Bahrain Grand Prix  2009-04-26   
4       5  2009      5          4     Spanish Grand Prix  2009-05-10   

       time                                                url fp1_date  \
0  06:00:00  http://en.wikipedia.org/wiki/2009_Australian_G...       \N   
1  09:00:00  http://en.wikipedia.org/wiki/2009_Malaysian_Gr...       \N   
2  07:00:00  http://en.wikipedia.org/wiki/2009_Chinese_Gran...       \N   
3  12:00:00  http://en.wikipedia.org/wiki/2009_Bahrain_Gran...       \N   
4  12:00:00  http://en.wikipedia.org/wiki/2009_Spanish_Gran...       \N   

  fp1_time fp2_date fp2_time fp3_date fp3_time quali_date quali_time  \
0       \N       \N       \N       \N       \N         \N         \N   
1       \N       \N       \N       \N       \N         \N         \N   
2       \N       \N       \N       \N       \N         \N         \N   
3       \N       \N       \N       \N       \N         \N         \N   
4       \N       \N       \N       \N       \N         \N         \N   

  sprint_date sprint_time  
0          \N          \N  
1          \N          \N  
2          \N          \N  
3          \N          \N  
4          \N          \N  

==== RESULTS ====
Shape: (26759, 18)
   resultId  raceId  driverId  constructorId number  grid position  \
0         1      18         1              1     22     1        1   
1         2      18         2              2      3     5        2   
2         3      18         3              3      7     7        3   
3         4      18         4              4      5    11        4   
4         5      18         5              1     23     3        5   

  positionText  positionOrder  points  laps         time milliseconds  \
0            1              1    10.0    58  1:34:50.616      5690616   
1            2              2     8.0    58       +5.478      5696094   
2            3              3     6.0    58       +8.163      5698779   
3            4              4     5.0    58      +17.181      5707797   
4            5              5     4.0    58      +18.014      5708630   

  fastestLap rank fastestLapTime fastestLapSpeed  statusId  
0         39    2       1:27.452         218.300         1  
1         41    3       1:27.739         217.586         1  
2         41    5       1:28.090         216.719         1  
3         58    7       1:28.603         215.464         1  
4         43    1       1:27.418         218.385         1  

==== SEASONS ====
Shape: (75, 2)
   year                                                url
0  2009  http://en.wikipedia.org/wiki/2009_Formula_One_...
1  2008  http://en.wikipedia.org/wiki/2008_Formula_One_...
2  2007  http://en.wikipedia.org/wiki/2007_Formula_One_...
3  2006  http://en.wikipedia.org/wiki/2006_Formula_One_...
4  2005  http://en.wikipedia.org/wiki/2005_Formula_One_...

==== SPRINT_RESULTS ====
Shape: (360, 16)
   resultId  raceId  driverId  constructorId  number  grid position  \
0         1    1061       830              9      33     2        1   
1         2    1061         1            131      44     1        2   
2         3    1061       822            131      77     3        3   
3         4    1061       844              6      16     4        4   
4         5    1061       846              1       4     6        5   

  positionText  positionOrder  points  laps       time milliseconds  \
0            1              1       3    17  25:38.426      1538426   
1            2              2       2    17     +1.430      1539856   
2            3              3       1    17     +7.502      1545928   
3            4              4       0    17    +11.278      1549704   
4            5              5       0    17    +24.111      1562537   

  fastestLap fastestLapTime  statusId  
0         14       1:30.013         1  
1         17       1:29.937         1  
2         17       1:29.958         1  
3         16       1:30.163         1  
4         16       1:30.566         1  

==== STATUS ====
Shape: (139, 2)
   statusId        status
0         1      Finished
1         2  Disqualified
2         3      Accident
3         4     Collision
4         5        Engine

To characterize "race-day performance," we are merging below tables:
results.csv
qualifying.csv
pit_stops.csv
lap_times.csv: average lap time per race

In [95]:
pd.set_option('display.max_columns', None)
results = dataframes["results"][["raceId", "driverId", "constructorId", "grid", "positionOrder", "points", "laps", "time", "milliseconds", "fastestLap", "rank", "fastestLapTime", "fastestLapSpeed", "statusId"]].rename(columns={"rank": "fastesLapRank", "milliseconds": "Total_time_ms", "positionOrder":"finalPosition"}).merge(dataframes["races"][["raceId", "year", "round", "circuitId", "name", "date"]], on="raceId", how="left")

#Merging with status table on statusId
results = results.merge(dataframes["status"], on="statusId", how="left")

# Merge with qualifying data
merged = results.merge(dataframes["qualifying"][["raceId", "driverId", "position", "q1", "q2", "q3"]].rename(columns={"position": "qualifyingPosition"}), on=["raceId", "driverId"], how="left")

merged = merged.merge(dataframes["drivers"][["driverId", "driverRef"]], on="driverId", how="left")

#Merge with pit_stops data
#df_pit = dataframes["pit_stops"].rename(columns={"stop": "pit_stop_number", "lap": "pit_stop_lap", "time":"pit_stop_time", "duration": "pit_stop_duration","milliseconds": "pit_stop_duration_ms"})
#merged = merged.merge(df_pit, on=["raceId", "driverId"], how="left")

pit_summary = dataframes["pit_stops"].copy()
pit_summary["milliseconds"] = pd.to_numeric(pit_summary["milliseconds"], errors="coerce")
pit_summary = pit_summary.groupby(["raceId", "driverId"]).agg(
    total_pit_stops=('stop', 'count'),
    average_pit_duration_ms=('milliseconds', 'mean')
).reset_index()
merged = merged.merge(pit_summary, on=["raceId", "driverId"], how="left")

#Merge with lap_times data
avg_lap_time = dataframes["lap_times"].groupby(["raceId", "driverId"])["milliseconds"].mean().reset_index(name="avg_lap_time")
avg_lap_time["avg_lap_time_str"] = avg_lap_time["avg_lap_time"].apply(
    lambda ms: f"{int(ms // 60000)}:{(ms % 60000) / 1000:.3f}" if pd.notna(ms) else None
)

merged = merged.merge(avg_lap_time.rename(columns = {"avg_lap_time": "avg_lap_time_ms"}), on=["raceId", "driverId"], how="left")
merged_matrix_comp = merged.copy()
In [97]:
merged.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26759 entries, 0 to 26758
Data columns (total 29 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   raceId                   26759 non-null  int64  
 1   driverId                 26759 non-null  int64  
 2   constructorId            26759 non-null  int64  
 3   grid                     26759 non-null  int64  
 4   finalPosition            26759 non-null  int64  
 5   points                   26759 non-null  float64
 6   laps                     26759 non-null  int64  
 7   time                     26759 non-null  object 
 8   Total_time_ms            26759 non-null  object 
 9   fastestLap               26759 non-null  object 
 10  fastesLapRank            26759 non-null  object 
 11  fastestLapTime           26759 non-null  object 
 12  fastestLapSpeed          26759 non-null  object 
 13  statusId                 26759 non-null  int64  
 14  year                     26759 non-null  int64  
 15  round                    26759 non-null  int64  
 16  circuitId                26759 non-null  int64  
 17  name                     26759 non-null  object 
 18  date                     26759 non-null  object 
 19  status                   26759 non-null  object 
 20  qualifyingPosition       10494 non-null  float64
 21  q1                       10494 non-null  object 
 22  q2                       10472 non-null  object 
 23  q3                       10448 non-null  object 
 24  driverRef                26759 non-null  object 
 25  total_pit_stops          5575 non-null   float64
 26  average_pit_duration_ms  5575 non-null   float64
 27  avg_lap_time_ms          11041 non-null  float64
 28  avg_lap_time_str         11041 non-null  object 
dtypes: float64(5), int64(10), object(14)
memory usage: 5.9+ MB
In [99]:
merged.isnull().sum()
Out[99]:
raceId                         0
driverId                       0
constructorId                  0
grid                           0
finalPosition                  0
points                         0
laps                           0
time                           0
Total_time_ms                  0
fastestLap                     0
fastesLapRank                  0
fastestLapTime                 0
fastestLapSpeed                0
statusId                       0
year                           0
round                          0
circuitId                      0
name                           0
date                           0
status                         0
qualifyingPosition         16265
q1                         16265
q2                         16287
q3                         16311
driverRef                      0
total_pit_stops            21184
average_pit_duration_ms    21184
avg_lap_time_ms            15718
avg_lap_time_str           15718
dtype: int64

Data Cleaning and Transformation¶

The columns (q1, q2, q3, total_pit_stops, average_pit_duration_ms, average_lap_time_str, avg_lap_time_ms) contains nearly a 50% of missing data. However, these columns are essential for clustering drivers based on performance. Imputing such a large percentage of missing values might introduce significant bias or noise into the analysis.

In [103]:
#dropping the NaN values
merged = merged.dropna(subset=['qualifyingPosition', 'q1', 'q2', 'q3', 'total_pit_stops', 'average_pit_duration_ms', 'avg_lap_time_ms', 'avg_lap_time_str']).reset_index(drop=True)

#Some columns has \N in the data and removing those values with \N.
merged.replace(to_replace='\\N', value=np.nan, inplace=True)
merged.dropna(inplace=True)

# Creating a new column position change between starting grid position and the final position
merged["position_change"] = merged["grid"] - merged["finalPosition"].astype(int)
In [105]:
#Finding if there are still missing values exist
merged.isnull().sum().sum()
Out[105]:
0
In [107]:
#checking if there any time columns of type object if so they are converted to milliseconds of type int
merged.select_dtypes(include="object").columns
Out[107]:
Index(['time', 'Total_time_ms', 'fastestLap', 'fastesLapRank',
       'fastestLapTime', 'fastestLapSpeed', 'name', 'date', 'status', 'q1',
       'q2', 'q3', 'driverRef', 'avg_lap_time_str'],
      dtype='object')
In [109]:
# Function to convert time column of datatype object to int(milliseconds)
def time_str_to_ms(t):
    if pd.isna(t): return np.nan
    t = str(t).strip()
    if re.match(r"^\d+:\d+\.\d+$", t):  # Matches '1:29.844' style
        minutes, rest = t.split(":")
        seconds = float(rest)
        return int(minutes) * 60000 + seconds * 1000

# Apply to time columns of type object to convert into milliseconds of type float.
merged["fastestLapTime_ms"] = merged["fastestLapTime"].apply(time_str_to_ms).astype("Int64")
merged["q1_ms"] = merged["q1"].apply(time_str_to_ms).astype("Int64")
merged["q2_ms"] = merged["q2"].apply(time_str_to_ms).astype("Int64")
merged["q3_ms"] = merged["q3"].apply(time_str_to_ms).astype("Int64")
merged["Total_time_ms"] = pd.to_numeric(merged["Total_time_ms"], errors="coerce").astype("Int64")
merged["fastestLap"] = pd.to_numeric(merged["fastestLap"], errors="coerce").astype("Int64")
merged["fastesLapRank"] = pd.to_numeric(merged["fastesLapRank"], errors="coerce").astype("Int64")
merged["fastestLapSpeed"] = pd.to_numeric(merged["fastestLapSpeed"], errors="coerce").astype("float64")
In [111]:
#Encoding the qualitative variable status
le = LabelEncoder()
merged["status_encoded"] = le.fit_transform(merged["status"])
print(dict(zip(le.classes_, le.transform(le.classes_))))
{'Disqualified': 0, 'Finished': 1, 'Power Unit': 2}
In [113]:
#Dropping the columns which doesn't contain nuymeric data
final_data = merged.drop(columns = ['time', 'fastestLapTime', 'q1', 'q2', 'q3', 'driverRef', 'avg_lap_time_str', 'name', 'status', 'date'])

Columns like raceId, driverId, and constructorId are numerically stored, but: They are still categorical identifiers just with numeric format. Their values do not carry mathematical meaning (e.g., driverId = 10 isn't "twice" driverId = 5). So removing those columns and perfroming SVD on the remaining features.

In [116]:
# raceId, driverId, constructorID,etc. such kind of data are not standardized
scaler = StandardScaler()
X_features = final_data.drop(columns=["raceId", "driverId", "constructorId", "circuitId", "statusId", "round", "status_encoded", "year", "laps"])
X_scaled = StandardScaler().fit_transform(X_features)
data_scaled_df = pd.DataFrame(X_scaled, columns=X_features.columns)
data_scaled_df_copy = data_scaled_df.copy()
In [118]:
#Pre scaled data
final_data.head()
Out[118]:
raceId driverId constructorId grid finalPosition points laps Total_time_ms fastestLap fastesLapRank fastestLapSpeed statusId year round circuitId qualifyingPosition total_pit_stops average_pit_duration_ms avg_lap_time_ms position_change fastestLapTime_ms q1_ms q2_ms q3_ms status_encoded
0 841 20 9 1 1 25.0 58 5370259 44 4 212.488 1 2011 1 1 1.0 2.0 23319.500000 92590.672414 0 89844 85296 84090 83529 1
1 841 1 1 2 2 18.0 58 5392556 41 8 211.382 1 2011 1 1 2.0 2.0 23213.000000 92975.103448 0 90314 85384 84595 84307 1
2 841 808 4 6 3 15.0 58 5400819 55 7 211.969 1 2011 1 1 6.0 2.0 25109.000000 93117.568966 3 90064 85543 85582 85247 1
3 841 4 6 5 4 12.0 58 5402031 49 2 213.336 1 2011 1 1 5.0 3.0 24055.000000 93138.465517 1 89487 85707 85242 84974 1
4 841 17 9 3 5 10.0 58 5408430 50 3 213.066 1 2011 1 1 3.0 3.0 24058.666667 93248.793103 -2 89600 85900 84658 84395 1
In [120]:
#Scaled data
data_scaled_df.head()
Out[120]:
grid finalPosition points Total_time_ms fastestLap fastesLapRank fastestLapSpeed qualifyingPosition total_pit_stops average_pit_duration_ms avg_lap_time_ms position_change fastestLapTime_ms q1_ms q2_ms q3_ms
0 -1.290033 -1.235402 1.737068 -0.439229 -0.391093 -0.513603 0.236956 -1.414751 -0.027390 -0.269378 -0.389816 -0.048314 -0.065757 -0.250179 -0.293561 -0.325067
1 -0.964747 -0.912358 0.794654 -0.420336 -0.635854 0.465612 0.184532 -1.046952 -0.027390 -0.269930 -0.369424 -0.048314 -0.025557 -0.243077 -0.252019 -0.262457
2 0.336397 -0.589315 0.390762 -0.413335 0.506365 0.220808 0.212356 0.424243 -0.027390 -0.260102 -0.361867 0.975400 -0.046940 -0.230246 -0.170826 -0.186809
3 0.011111 -0.266271 -0.013130 -0.412308 0.016842 -1.003211 0.277152 0.056444 1.016529 -0.265566 -0.360759 0.292924 -0.096292 -0.217011 -0.198795 -0.208779
4 -0.639461 0.056772 -0.282391 -0.406886 0.098430 -0.758407 0.264354 -0.679154 1.016529 -0.265547 -0.354907 -0.730789 -0.086627 -0.201436 -0.246836 -0.255375

MATRIX COMPLETION¶

In [123]:
merged_matrix_comp["fastestLapTime_ms"] = merged_matrix_comp["fastestLapTime"].apply(time_str_to_ms).astype("Int64")
merged_matrix_comp["q1_ms"] = merged_matrix_comp["q1"].apply(time_str_to_ms).astype("Int64")
merged_matrix_comp["q2_ms"] = merged_matrix_comp["q2"].apply(time_str_to_ms).astype("Int64")
merged_matrix_comp["q3_ms"] = merged_matrix_comp["q3"].apply(time_str_to_ms).astype("Int64")
merged_matrix_comp["Total_time_ms"] = pd.to_numeric(merged_matrix_comp["Total_time_ms"], errors="coerce").astype("Int64")
merged_matrix_comp["fastestLap"] = pd.to_numeric(merged_matrix_comp["fastestLap"], errors="coerce").astype("Int64")
merged_matrix_comp["fastesLapRank"] = pd.to_numeric(merged_matrix_comp["fastesLapRank"], errors="coerce").astype("Int64")
merged_matrix_comp["fastestLapSpeed"] = pd.to_numeric(merged_matrix_comp["fastestLapSpeed"], errors="coerce").astype("float64")

merged_matrix_comp = merged_matrix_comp.drop(columns=["fastestLapTime", "q1", "q2", "q3", "avg_lap_time_str", "time", "name", "date", "status", "driverRef" ])

df_all = merged_matrix_comp.copy().astype("float")
scaler = StandardScaler()
new_df = scaler.fit_transform(df_all)

# Converted to the matrix, and used SoftImpute
soft_imputer = SoftImpute()
X_completed = soft_imputer.fit_transform(new_df)

# Step 4: Convert back to DataFrame
completed_df = pd.DataFrame(X_completed, columns=df_all.columns, index=df_all.index)
[SoftImpute] Max Singular Value of X_init = 266.705073
[SoftImpute] Iter 1: observed MAE=0.027991 rank=23
[SoftImpute] Iter 2: observed MAE=0.027993 rank=23
[SoftImpute] Iter 3: observed MAE=0.027993 rank=23
[SoftImpute] Iter 4: observed MAE=0.027993 rank=23
[SoftImpute] Iter 5: observed MAE=0.027991 rank=23
[SoftImpute] Iter 6: observed MAE=0.027988 rank=23
[SoftImpute] Iter 7: observed MAE=0.027984 rank=23
[SoftImpute] Iter 8: observed MAE=0.027979 rank=23
[SoftImpute] Iter 9: observed MAE=0.027973 rank=23
[SoftImpute] Iter 10: observed MAE=0.027967 rank=23
[SoftImpute] Iter 11: observed MAE=0.027960 rank=23
[SoftImpute] Iter 12: observed MAE=0.027952 rank=23
[SoftImpute] Iter 13: observed MAE=0.027945 rank=23
[SoftImpute] Iter 14: observed MAE=0.027937 rank=23
[SoftImpute] Iter 15: observed MAE=0.027929 rank=23
[SoftImpute] Iter 16: observed MAE=0.027921 rank=23
[SoftImpute] Iter 17: observed MAE=0.027913 rank=23
[SoftImpute] Iter 18: observed MAE=0.027905 rank=23
[SoftImpute] Iter 19: observed MAE=0.027897 rank=23
[SoftImpute] Iter 20: observed MAE=0.027889 rank=23
[SoftImpute] Iter 21: observed MAE=0.027881 rank=23
[SoftImpute] Iter 22: observed MAE=0.027874 rank=23
[SoftImpute] Iter 23: observed MAE=0.027866 rank=23
[SoftImpute] Iter 24: observed MAE=0.027859 rank=23
[SoftImpute] Iter 25: observed MAE=0.027852 rank=23
[SoftImpute] Iter 26: observed MAE=0.027845 rank=23
[SoftImpute] Iter 27: observed MAE=0.027839 rank=23
[SoftImpute] Iter 28: observed MAE=0.027832 rank=23
[SoftImpute] Iter 29: observed MAE=0.027826 rank=23
[SoftImpute] Iter 30: observed MAE=0.027820 rank=23
[SoftImpute] Iter 31: observed MAE=0.027815 rank=23
[SoftImpute] Iter 32: observed MAE=0.027809 rank=23
[SoftImpute] Iter 33: observed MAE=0.027804 rank=23
[SoftImpute] Iter 34: observed MAE=0.027799 rank=23
[SoftImpute] Iter 35: observed MAE=0.027794 rank=23
[SoftImpute] Iter 36: observed MAE=0.027789 rank=23
[SoftImpute] Iter 37: observed MAE=0.027785 rank=23
[SoftImpute] Iter 38: observed MAE=0.027781 rank=23
[SoftImpute] Iter 39: observed MAE=0.027776 rank=23
[SoftImpute] Iter 40: observed MAE=0.027772 rank=23
[SoftImpute] Iter 41: observed MAE=0.027768 rank=23
[SoftImpute] Iter 42: observed MAE=0.027765 rank=23
[SoftImpute] Iter 43: observed MAE=0.027761 rank=23
[SoftImpute] Iter 44: observed MAE=0.027758 rank=23
[SoftImpute] Iter 45: observed MAE=0.027754 rank=23
[SoftImpute] Iter 46: observed MAE=0.027751 rank=23
[SoftImpute] Iter 47: observed MAE=0.027748 rank=23
[SoftImpute] Iter 48: observed MAE=0.027745 rank=23
[SoftImpute] Iter 49: observed MAE=0.027742 rank=23
[SoftImpute] Iter 50: observed MAE=0.027739 rank=23
[SoftImpute] Iter 51: observed MAE=0.027736 rank=23
[SoftImpute] Iter 52: observed MAE=0.027733 rank=23
[SoftImpute] Iter 53: observed MAE=0.027731 rank=23
[SoftImpute] Iter 54: observed MAE=0.027728 rank=23
[SoftImpute] Iter 55: observed MAE=0.027726 rank=23
[SoftImpute] Iter 56: observed MAE=0.027723 rank=23
[SoftImpute] Iter 57: observed MAE=0.027721 rank=23
[SoftImpute] Iter 58: observed MAE=0.027719 rank=23
[SoftImpute] Iter 59: observed MAE=0.027717 rank=23
[SoftImpute] Iter 60: observed MAE=0.027715 rank=23
[SoftImpute] Iter 61: observed MAE=0.027713 rank=23
[SoftImpute] Iter 62: observed MAE=0.027711 rank=23
[SoftImpute] Iter 63: observed MAE=0.027709 rank=23
[SoftImpute] Iter 64: observed MAE=0.027707 rank=23
[SoftImpute] Iter 65: observed MAE=0.027706 rank=23
[SoftImpute] Iter 66: observed MAE=0.027704 rank=23
[SoftImpute] Iter 67: observed MAE=0.027703 rank=23
[SoftImpute] Iter 68: observed MAE=0.027701 rank=23
[SoftImpute] Iter 69: observed MAE=0.027699 rank=23
[SoftImpute] Iter 70: observed MAE=0.027698 rank=23
[SoftImpute] Iter 71: observed MAE=0.027697 rank=23
[SoftImpute] Iter 72: observed MAE=0.027695 rank=23
[SoftImpute] Iter 73: observed MAE=0.027694 rank=23
[SoftImpute] Iter 74: observed MAE=0.027693 rank=23
[SoftImpute] Iter 75: observed MAE=0.027691 rank=23
[SoftImpute] Iter 76: observed MAE=0.027690 rank=23
[SoftImpute] Iter 77: observed MAE=0.027689 rank=23
[SoftImpute] Iter 78: observed MAE=0.027688 rank=23
[SoftImpute] Iter 79: observed MAE=0.027687 rank=23
[SoftImpute] Iter 80: observed MAE=0.027686 rank=23
[SoftImpute] Iter 81: observed MAE=0.027685 rank=23
[SoftImpute] Iter 82: observed MAE=0.027684 rank=23
[SoftImpute] Iter 83: observed MAE=0.027683 rank=23
[SoftImpute] Iter 84: observed MAE=0.027682 rank=23
[SoftImpute] Iter 85: observed MAE=0.027681 rank=23
[SoftImpute] Iter 86: observed MAE=0.027680 rank=23
[SoftImpute] Iter 87: observed MAE=0.027680 rank=23
[SoftImpute] Iter 88: observed MAE=0.027679 rank=23
[SoftImpute] Iter 89: observed MAE=0.027678 rank=23
[SoftImpute] Iter 90: observed MAE=0.027677 rank=23
[SoftImpute] Iter 91: observed MAE=0.027677 rank=23
[SoftImpute] Iter 92: observed MAE=0.027676 rank=23
[SoftImpute] Iter 93: observed MAE=0.027675 rank=23
[SoftImpute] Iter 94: observed MAE=0.027674 rank=23
[SoftImpute] Iter 95: observed MAE=0.027674 rank=23
[SoftImpute] Iter 96: observed MAE=0.027673 rank=23
[SoftImpute] Iter 97: observed MAE=0.027673 rank=23
[SoftImpute] Iter 98: observed MAE=0.027672 rank=23
[SoftImpute] Iter 99: observed MAE=0.027671 rank=23
[SoftImpute] Iter 100: observed MAE=0.027671 rank=23
[SoftImpute] Stopped after iteration 100 for lambda=5.334101
In [125]:
print("Missing values before:", np.isnan(new_df).sum())
print("Missing values after: ", np.isnan(completed_df).sum().sum())
print("Missing %:", np.isnan(new_df).sum() / new_df.size * 100)
Missing values before: 227663
Missing values after:  0
Missing %: 36.990886446981676

Before imputation: 227,663 entries were missing, which accounted for ~37% of the entire feature matrix. After SoftImpute: All missing values have been successfully filled, resulting in a complete matrix with 0% missing data. The rank remained fixed at 23 across all iterations, algorithm settled on a low-rank approximation of the data matrix with 23 latent factors.

SVD¶

In [61]:
np.random.seed(42)

U, S, VT = np.linalg.svd(data_scaled_df, full_matrices=False)

explained_variance = (S ** 2) / np.sum(S ** 2) # Explained variance
svd_projection = U[:, :2] @ np.diag(S[:2]) # 2D SVD Projection

#sns.set(rc={'axes.facecolor': '#fcf0dc'}, style='darkgrid')
cumulative_explained_variance = np.cumsum(explained_variance)

# Plot the cumulative explained variance against the number of components
plt.figure(figsize=(20, 10))

# Bar chart for the explained variance of each component
barplot = sns.barplot(x=list(range(1, len(cumulative_explained_variance) + 1)),
                      y=explained_variance,
                      alpha=0.8)

# Line plot for the cumulative explained variance
lineplot, = plt.plot(range(0, len(cumulative_explained_variance)), cumulative_explained_variance,
                     marker='o', linestyle='--',  linewidth=2)

# Set labels and title
plt.xlabel('Number of Components', fontsize=14)
plt.ylabel('Explained Variance', fontsize=14)
plt.title('Cumulative Variance vs. Number of Components', fontsize=18)

# Customize ticks and legend
plt.xticks(range(0, len(cumulative_explained_variance)))
plt.legend(handles=[barplot.patches[0], lineplot],
           labels=['Explained Variance of Each Component', 'Cumulative Explained Variance'],
           loc=(0.62, 0.1),
           frameon=True,
           framealpha=1.0,  
           )  

# Display the variance values for both graphs on the plots
x_offset = -0.3
y_offset = 0.01
for i, (ev_ratio, cum_ev_ratio) in enumerate(zip(explained_variance, cumulative_explained_variance)):
    plt.text(i, ev_ratio, f"{ev_ratio:.2f}", ha="center", va="bottom", fontsize=10)
    if i > 0:
        plt.text(i + x_offset, cum_ev_ratio + y_offset, f"{cum_ev_ratio:.2f}", ha="center", va="bottom", fontsize=10)

plt.grid(axis='both')   
plt.show()
No description has been provided for this image

The plot and the cumulative explained variance values indicate how much of the total variance in the dataset is captured by each component, as well as the cumulative variance explained by the first n components.

Here, we can observe that:

The first component explains approximately 31% of the variance.
The first two components together explain about 54% of the variance.
The first six components explain approximately 88% of the variance, and so on.

Choosing optimal number of components as 6 after which adding another component doesn't significantly increase in the cummulative variance, often referred to as the "elbow point" in the curve.

From the plot, we can see that the increase in cumulative variance starts to slow down after the 6th component (which captures about 88% of the total variance) and also adding another component has no change in the variance of the individua component.

PCA¶

In [65]:
# performing PCA with 9 components
pca = PCA(n_components=6)
F1_data_pca = pca.fit_transform(data_scaled_df)

# Creating a new dataframe from the PCA dataframe, with columns labeled PC1, PC2, etc.
F1_data_pca = pd.DataFrame(F1_data_pca, columns=['PC'+str(i+1) for i in range(pca.n_components_)])

id_index = final_data[["raceId", "driverId"]].reset_index(drop=True)
pca_df = pd.concat([id_index, F1_data_pca], axis=1).set_index(["raceId", "driverId"])
pca_df = pca_df.reset_index()
pca_df.head()
Out[65]:
raceId driverId PC1 PC2 PC3 PC4 PC5 PC6
0 841 20 -1.454654 -2.467678 -0.395891 -0.405817 0.197826 0.369316
1 841 1 -0.852697 -1.307286 -0.500208 -0.520739 0.199261 0.652279
2 841 808 -0.563475 0.022435 -0.643267 1.149230 -0.113728 -0.262184
3 841 4 -0.646255 -0.312125 -0.114757 0.235873 0.206241 -1.254385
4 841 17 -0.756625 -0.405742 0.096841 -1.191356 0.212870 -1.058389
In [66]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Reduce to 2 dimensions for visualization
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(data_scaled_df)
In [67]:
fig, axs = plt.subplots(1, 2, figsize=(14, 6))

# --- Plot 1: PCA projection (PC1 vs PC2)
axs[0].scatter(F1_data_pca['PC1'], F1_data_pca['PC2'], alpha=0.6, color='cornflowerblue')
axs[0].set_title("PCA Projection: PC1 vs PC2")
axs[0].set_xlabel("Principal Component 1")
axs[0].set_ylabel("Principal Component 2")
axs[0].grid(True)

# --- Plot 2: Original features (avg_lap_time vs average_pit_duration)
axs[1].scatter(data_scaled_df_copy['avg_lap_time_ms'], data_scaled_df_copy['q3_ms'], alpha=0.6, color='darkorange')
axs[1].set_title("Original Feature Space")
axs[1].set_xlabel("Average Lap Time (ms) [standardized]")
axs[1].set_ylabel("Average q3 Time (ms) [standardized]")
axs[1].grid(True)

plt.tight_layout()
plt.show()
No description has been provided for this image

There is no clear seperation between the data when we visualize the 2 components of PCA. Next we try to perform K-mean on this data to visualize the clusters.

In [69]:
def highlight_top3(column):
    top3 = column.abs().nlargest(3).index
    return ['background-color:  #ffeacc' if i in top3 else '' for i in column.index]

# Create the PCA component DataFrame and apply the highlighting function
pc_df = pd.DataFrame(pca.components_.T, columns=['PC{}'.format(i+1) for i in range(pca.n_components_)],  
                     index=data_scaled_df.columns)

pc_df.style.apply(highlight_top3, axis=0)
Out[69]:
  PC1 PC2 PC3 PC4 PC5 PC6
grid 0.152734 0.378691 -0.078314 0.447781 0.003427 -0.123384
finalPosition 0.167506 0.453206 0.019497 -0.263958 0.028400 -0.081915
points -0.164398 -0.458953 -0.004199 0.197710 -0.021615 0.090173
Total_time_ms 0.052056 0.021915 0.597753 0.102366 -0.173509 0.187021
fastestLap -0.241297 0.078886 0.111464 0.032653 -0.282897 -0.381710
fastesLapRank 0.155376 0.357945 -0.063381 -0.091202 -0.045254 0.370206
fastestLapSpeed -0.033061 -0.016724 -0.233901 -0.012725 0.792781 -0.209529
qualifyingPosition 0.160590 0.416327 -0.053051 0.290780 -0.010428 -0.067910
total_pit_stops 0.004263 0.017461 0.376780 -0.077143 0.031884 -0.715291
average_pit_duration_ms 0.027882 0.019915 0.466359 0.109448 0.462281 0.228829
avg_lap_time_ms 0.311299 -0.094740 0.409112 0.076097 0.168784 0.060818
position_change -0.016716 -0.081469 -0.102749 0.748564 -0.026403 -0.042905
fastestLapTime_ms 0.424442 -0.162026 -0.046598 -0.014508 -0.069647 0.007642
q1_ms 0.419856 -0.177818 -0.080431 -0.030560 -0.059503 -0.097045
q2_ms 0.422749 -0.175166 -0.086161 -0.028154 -0.056222 -0.105239
q3_ms 0.420119 -0.165089 -0.093527 -0.012085 -0.049619 -0.116594
In [139]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
scaling_factor = 9
# Intersect with valid raceIds
selected_drivers = ['hamilton', 'vettel', 'max_verstappen', 'alonso', 'ricciardo']
#selected_races = [841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 861, 862, 863, 864, 865, 866, 867, 868, 869]  # Example raceIds
#selected_races = list(range(1000, 1010))
selected_races = list(range(880, 900))
# Filter the merged dataframe
subset = merged[(merged['driverRef'].isin(selected_drivers)) & (merged['raceId'].isin(selected_races))].copy()

# Get the aligned scaled data
subset_scaled = data_scaled_df_copy.loc[subset.index]
subset_pca = pd.DataFrame(pca_2d.transform(subset_scaled), columns=['PC1', 'PC2'], index=subset.index)
colors = plt.cm.get_cmap("tab10", 10)
# Plot setup
fig, ax = plt.subplots(figsize=(12, 8))

# Plot PCA scores for each driver
for i, driver in enumerate(selected_drivers):
    idx = subset[subset['driverRef'] == driver].index
    ax.scatter(subset_pca.loc[idx, 'PC1'], subset_pca.loc[idx, 'PC2'],
               color=colors(i), label=driver, alpha=0.8, s=60)
line_length = 10         # Keeps arrow/line the same        # Keeps arrow/line the same
label_offset = 10        # Increase this to spread out text more

for i, (x_comp, y_comp) in enumerate(pca_2d.components_.T):
    # Line (arrow without head)
    ax.plot([0, x_comp * line_length], [0, y_comp * line_length], color='black', linewidth=1.5)

    # Text with increased spacing and rotation
    angle_rad = np.arctan2(y_comp, x_comp)
    angle_deg = np.degrees(angle_rad)

# Flip upside-down labels to keep them upright
    if angle_deg > 90 or angle_deg < -90:
        angle_deg += 180
    ax.text(x_comp * label_offset, 
            y_comp * label_offset,
            data_scaled_df_copy.columns[i],
            fontsize=11, color='darkred',
            rotation=angle_deg, rotation_mode='anchor',
            ha='center', va='center')

# Annotate 10 driver-race combos
#np.random.seed(42)
label_idx = np.random.choice(subset.index, size=30, replace=False)
for i in label_idx:
    label = subset.loc[i, 'driverRef'] + "_race" + str(subset.loc[i, 'raceId'])
    ax.text(subset_pca.loc[i, 'PC1'], subset_pca.loc[i, 'PC2'], label, fontsize=9)

# Aesthetics
ax.set_xlabel("Principal Component 1")
ax.set_ylabel("Principal Component 2")
ax.set_title("PCA Projection with driver/race combo")
ax.axhline(0, color='gray', lw=0.5)
ax.axvline(0, color='gray', lw=0.5)
ax.grid(True)
ax.legend(title="Driver")
plt.tight_layout()
plt.show()
/var/folders/y4/fkn560k16c3__twnyy0m4jxr0000gn/T/ipykernel_16833/4015335785.py:17: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  colors = plt.cm.get_cmap("tab10", 10)
No description has been provided for this image
In [141]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

scaling_factor = 9
line_length = 10
label_offset = 10

# Define your selected drivers and races
selected_drivers = ['vettel']
#selected_races = list(range(880, 900))
selected_races = [883, 888, 893, 898, 899]

# Filter the merged dataframe for selected drivers and races
subset_all = merged[(merged['driverRef'].isin(selected_drivers)) & (merged['raceId'].isin(selected_races))].copy()

# Filter only winning entries (final position == 1)
subset = subset_all[subset_all['finalPosition'] == 1].copy()

# Get the aligned scaled data
subset_scaled = data_scaled_df_copy.loc[subset.index]
subset_pca = pd.DataFrame(pca_2d.transform(subset_scaled), columns=['PC1', 'PC2'], index=subset.index)

colors = plt.cm.get_cmap("tab10", 10)

# Plot setup
fig, ax = plt.subplots(figsize=(12, 8))

# Plot PCA scores for each winner
for i, driver in enumerate(selected_drivers):
    idx = subset[subset['driverRef'] == driver].index
    ax.scatter(subset_pca.loc[idx, 'PC1'], subset_pca.loc[idx, 'PC2'],
               color=colors(i), label=driver, alpha=0.8, s=60)

# Plot PCA directions
for i, (x_comp, y_comp) in enumerate(pca_2d.components_.T):
    ax.plot([0, x_comp * line_length], [0, y_comp * line_length], color='black', linewidth=1.5)
    angle_rad = np.arctan2(y_comp, x_comp)
    angle_deg = np.degrees(angle_rad)
    if angle_deg > 90 or angle_deg < -90:
        angle_deg += 180
    ax.text(x_comp * label_offset,
            y_comp * label_offset,
            data_scaled_df_copy.columns[i],
            fontsize=11, color='darkred',
            rotation=angle_deg, rotation_mode='anchor',
            ha='center', va='center')

# Label the winning driver-race combos
for i in subset.index:
    label = subset.loc[i, 'driverRef'] + "_race" + str(subset.loc[i, 'raceId'])
    ax.text(subset_pca.loc[i, 'PC1'], subset_pca.loc[i, 'PC2'], label, fontsize=9)

# Aesthetics
ax.set_xlabel("Principal Component 1")
ax.set_ylabel("Principal Component 2")
ax.set_title("PCA Projection (Only Race Winners Shown)")
ax.axhline(0, color='gray', lw=0.5)
ax.axvline(0, color='gray', lw=0.5)
ax.grid(True)
ax.legend(title="Driver")
plt.tight_layout()
plt.show()
/var/folders/y4/fkn560k16c3__twnyy0m4jxr0000gn/T/ipykernel_16833/3071724993.py:25: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  colors = plt.cm.get_cmap("tab10", 10)
No description has been provided for this image

K Means¶

Determining the optimal number of clusters¶

In [163]:
ks = range(2, 15)
wss = []
silhouette_scores = []

for k in ks:
    kmeans = KMeans(n_clusters=k, n_init=20, random_state=42)
    labels = kmeans.fit_predict(F1_data_pca)
    wss.append(kmeans.inertia_)
    sil_score = silhouette_score(F1_data_pca, labels)
    silhouette_scores.append(sil_score)

# Plot
fig, ax1 = plt.subplots(figsize=(10, 5))

ax1.set_xlabel('Number of Clusters (k)')
ax1.set_ylabel('Within-Cluster Sum of Squares')
ax1.plot(ks, wss, marker='o', label='WSS')
ax1.tick_params(axis='y')

ax2 = ax1.twinx()
ax2.set_ylabel('Silhouette Score')
ax2.plot(ks, silhouette_scores, marker='s', linestyle='--', label='Silhouette')
ax2.tick_params(axis='y')

plt.title('Clustering Performance vs Number of Clusters (k)')
fig.tight_layout()
plt.grid(True)
plt.show()
No description has been provided for this image
In [164]:
kmeans = KMeans(n_clusters=4, init='k-means++', n_init=9, max_iter=100, random_state=0)
kmeans.fit(F1_data_pca)

# Optional: get cluster frequencies
cluster_frequencies = Counter(kmeans.labels_)

# Optional: sort clusters by size (optional relabeling)
sorted_clusters = [c for c, _ in cluster_frequencies.most_common()]
relabel = {old: new for new, old in enumerate(sorted_clusters)}
new_labels = np.array([relabel[l] for l in kmeans.labels_])

# Add cluster labels
F1_data_pca["cluster"] = new_labels
data_scaled_df["cluster"] = new_labels
In [167]:
data_scaled_df.head()
Out[167]:
grid finalPosition points Total_time_ms fastestLap fastesLapRank fastestLapSpeed qualifyingPosition total_pit_stops average_pit_duration_ms avg_lap_time_ms position_change fastestLapTime_ms q1_ms q2_ms q3_ms cluster
0 -1.290033 -1.235402 1.737068 -0.439229 -0.391093 -0.513603 0.236956 -1.414751 -0.027390 -0.269378 -0.389816 -0.048314 -0.065757 -0.250179 -0.293561 -0.325067 0
1 -0.964747 -0.912358 0.794654 -0.420336 -0.635854 0.465612 0.184532 -1.046952 -0.027390 -0.269930 -0.369424 -0.048314 -0.025557 -0.243077 -0.252019 -0.262457 0
2 0.336397 -0.589315 0.390762 -0.413335 0.506365 0.220808 0.212356 0.424243 -0.027390 -0.260102 -0.361867 0.975400 -0.046940 -0.230246 -0.170826 -0.186809 0
3 0.011111 -0.266271 -0.013130 -0.412308 0.016842 -1.003211 0.277152 0.056444 1.016529 -0.265566 -0.360759 0.292924 -0.096292 -0.217011 -0.198795 -0.208779 0
4 -0.639461 0.056772 -0.282391 -0.406886 0.098430 -0.758407 0.264354 -0.679154 1.016529 -0.265547 -0.354907 -0.730789 -0.086627 -0.201436 -0.246836 -0.255375 0
In [169]:
colors = ['#e8000b', '#1ac938', '#023eff', '#000000']
cluster_0 = F1_data_pca[F1_data_pca['cluster'] == 0]
cluster_1 = F1_data_pca[F1_data_pca['cluster'] == 1]
cluster_2 = F1_data_pca[F1_data_pca['cluster'] == 2]
cluster_3 = F1_data_pca[F1_data_pca['cluster'] == 3]
# Create a 3D scatter plot
fig = go.Figure()

# Add data points for each cluster separately and specify the color
fig.add_trace(go.Scatter3d(x=cluster_0['PC1'], y=cluster_0['PC2'], z=cluster_0['PC3'], 
                           mode='markers', marker=dict(color=colors[0], size=5, opacity=0.4), name='Cluster 0'))
fig.add_trace(go.Scatter3d(x=cluster_1['PC1'], y=cluster_1['PC2'], z=cluster_1['PC3'], 
                           mode='markers', marker=dict(color=colors[1], size=5, opacity=0.4), name='Cluster 1'))
fig.add_trace(go.Scatter3d(x=cluster_2['PC1'], y=cluster_2['PC2'], z=cluster_2['PC3'], 
                           mode='markers', marker=dict(color=colors[2], size=5, opacity=0.4), name='Cluster 2'))
fig.add_trace(go.Scatter3d(x=cluster_3['PC1'], y=cluster_3['PC2'], z=cluster_3['PC3'], 
                           mode='markers', marker=dict(color=colors[3], size=5, opacity=0.4), name='Cluster 3'))

# Set the title and layout details
fig.update_layout(
    title=dict(text='3D Visualization of Clusters in PCA Space', x=0.5),
    scene=dict(
        xaxis=dict(backgroundcolor="#fcf0dc", gridcolor='white', title='PC1'),
        yaxis=dict(backgroundcolor="#fcf0dc", gridcolor='white', title='PC2'),
        zaxis=dict(backgroundcolor="#fcf0dc", gridcolor='white', title='PC3'),
    ),
    width=900,
    height=800
)

# Show the plot
fig.show()
In [ ]:
num_observations = len(F1_data_pca)

# Separate the features and the cluster labels
X = F1_data_pca.drop('cluster', axis=1)
clusters = F1_data_pca['cluster']

# Compute the metrics
sil_score = silhouette_score(X, clusters)

# Create a table to display the metrics and the number of observations
table_data = [
    ["Number of Observations", num_observations],
    ["Silhouette Score", sil_score]
]

# Print the table
print(tabulate(table_data, headers=["Metric", "Value"], tablefmt='pretty'))
In [ ]:
colors = ['#e8000b', '#1ac938', '#023eff', '#000000']

# Standardize the data (excluding the cluster column)
#scaler = StandardScaler()
df_standardized = data_scaled_df.copy()

# Create a new dataframe with standardized values and add the cluster column back
df_standardized = pd.DataFrame(data_scaled_df, columns=data_scaled_df.columns[:-1], index=data_scaled_df.index)
df_standardized['cluster'] = data_scaled_df['cluster']

# Calculate the centroids of each cluster
cluster_centroids = df_standardized.groupby('cluster').mean()

# Function to create a radar chart
def create_radar_chart(ax, angles, data, color, cluster):
    # Plot the data and fill the area
    ax.fill(angles, data, color=color, alpha=0.4)
    ax.plot(angles, data, color=color, linewidth=2, linestyle='solid')
    
    # Add a title
    ax.set_title(f'Cluster {cluster}', size=20, color=color, y=1.1)

# Set data
labels=np.array(cluster_centroids.columns)
num_vars = len(labels)

# Compute angle of each axis
angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()

# The plot is circular, so we need to "complete the loop" and append the start to the end
labels = np.concatenate((labels, [labels[0]]))
angles += angles[:1]

# Initialize the figure
fig, ax = plt.subplots(figsize=(20, 10), subplot_kw=dict(polar=True), nrows=1, ncols=4)

# Create radar chart for each cluster
for i, color in enumerate(colors):
    data = cluster_centroids.loc[i].tolist()
    data += data[:1]  # Complete the loop
    create_radar_chart(ax[i], angles, data, color, i)

# Add input data
ax[0].set_xticks(angles[:-1])
ax[0].set_xticklabels(labels[:-1])

ax[1].set_xticks(angles[:-1])
ax[1].set_xticklabels(labels[:-1])

ax[2].set_xticks(angles[:-1])
ax[2].set_xticklabels(labels[:-1])

ax[3].set_xticks(angles[:-1])
ax[3].set_xticklabels(labels[:-1])

# Add a grid
ax[0].grid(color='grey', linewidth=0.5)

# Display the plot
plt.tight_layout()
plt.show()

Interpretation from radar chart
Cluster 0 (Red)
High on: fastestLap, points, positionOrder(final position)
Low on: position_change (negative) and qualifyingPosition
Interpretation: These drivers tend to finish at the top with fast lap times and strong race-day performance, even if they lose positions relative to their start. Possibly consistent, fast finishers.

Cluster 1 (Green)
High on: q1_ms, q2_ms, q3_ms (longer times), position_change

Low on: race outcomes like points, positionOrder, grid\

Interpretation: Drivers in this group qualify poorly (slow qualifying laps) and often start at the back — but show strong racecraft, gaining positions during the race. Possibly skilled midfielders or rookies.

Cluster 2 (Blue)
High on: fastestLapSpeed, qualifyingPosition, positionOrder

Low on: points, pit_stops, avg_lap_time_ms

Interpretation: Drivers in this cluster are excellent qualifiers, with strong lap speeds, but may not convert those into high points. Possibly inconsistent performers or drivers in weaker cars.

Cluster 3 (Black)
High on: average_pit_duration_ms, total_pit_stops, Total_time_ms

Low on: everything else

Interpretation: Likely poor performers or drivers impacted by long pit stops, technical issues, or penalties. These could be backmarkers or those who don’t finish well

In [ ]:
 

Hierarchical Clustering¶

In [ ]:
linkage_methods = ['single', 'complete', 'average', 'centroid']
cluster_numbers  = [2, 3, 4, 5, 6]
In [ ]:
# Looping through each types of linkage
for method in linkage_methods:

    Z = linkage(X_scaled, method=method, metric='euclidean')

    # Ploting all dendrogram
    plt.figure(figsize=(15, 8))
    dendrogram(
        Z, leaf_rotation=90, leaf_font_size=6, color_threshold=None, no_labels=True
    )
    plt.title(f"{method.capitalize()} Linkage Dendrogram")
    plt.xlabel("Cluster ID")
    plt.ylabel("Distance")
    plt.tight_layout()
    plt.show()

    # Trying out different numbers of flat clusters
    print(f"\n {method.capitalize()} linkage:")
    for k in cluster_numbers:
        labels = fcluster(Z, t=k, criterion='maxclust')
        # bincount returns the counts for labels 1… k
        sizes = np.bincount(labels)[1:]
        print(f"k = {k:>2}  ->  cluster sizes: {sizes.tolist()}")

Key Observations:

--- Single linkage :

  • Extremely imbalanced clusters at every k
  • One massive cluster plus very small clusters observed

--- Complete linkage

  • It Produces more balanced mid-sized clusters at moderate values of k like 4, 5, 6
  • It still isolates some true outliers but captures two or three main grouping among the bulk of the data.

--- Average and Centroid linkage

  • Both are giving outcomes similar to the single linkage.
  • One dominating cluster and other very small clusters/ few outliers.

Scaling¶

In [ ]:
drop_cols = [
    "raceId","constructorId","circuitId", "statusId","round","year","laps","hc_cluster"
]
# Here, numeric features only considered, then dropping unwanted columns
numeric_cols = final_data.select_dtypes(include="number").columns
features = final_data[numeric_cols].drop(columns=drop_cols, errors="ignore")
In [ ]:
# Selected random sub-sample of 100 rows for dendrogram
n_samples = 100
sub_idx   = np.random.choice(features.index, size=n_samples, replace=False)
X_sub     = features.loc[sub_idx]

driver_ids = final_data.loc[sub_idx, 'driverId'].astype(str).values

# Now, lets scale sub-sampled data
scaler     = StandardScaler()
X_scaled   = scaler.fit_transform(X_sub)
scaled_df  = pd.DataFrame(X_scaled, columns=X_sub.columns, index=X_sub.index)
In [ ]:
# Linkage computation and dendrograms plots
linkage_methods = ['single','complete','average','centroid']
for method in linkage_methods:
    Z = linkage(X_scaled, method=method, metric='euclidean')

    plt.figure(figsize=(16, 6))
    dendrogram(
        Z, 
        labels = driver_ids,
        leaf_rotation=90,
        leaf_font_size=8,
    )
    plt.title(f"{method.capitalize()} Linkage (truncated) with driver IDs")
    plt.xlabel("Driver ID")
    plt.ylabel("Distance")
    plt.tight_layout()
    plt.show()

    # 4) Extract flat clusters on the full linkage of the sub-sample
    print(f"\n{method.capitalize()} linkage flat clusters (n={n_samples}):")
    for k in [2, 3, 4, 5, 6]:
        labels = fcluster(Z, t=k, criterion='maxclust')
        sizes  = np.bincount(labels)[1:]
        print(f"k={k:<2} -> sizes = {sizes.tolist()}")

We first considered 100 sub-sampled data, scaled the subsampled numeric features dropping non-numeric columns.

Effect of feature scaling on:

--- Single:

  • Normalizing the features slightly changes the linkage distances still forms one giant cluster containing almost all points, with each outlier split off.

--- Complete:

  • More balanced groupings as scaling yields two or three mid-sized clusters instead of a single massive one.
  • Outliers continue to be isolated as tiny clusters, but is now spread more evenly across clusters.
  • It seems to be the best linkage type for this dataset as it is properly or efficiently dividing the scaled data into clusters.

--- Average:

  • Still produces one dominant cluster holding most points, plus a smaller secondary cluster and a few singleton outliers.

--- Centroid:

  • Scaling makes centroid calculations more stable, but the big cluster and outliers persists.

--- Difference observed after scaling data:

  • Scaling balances the influence of each feature, so no single variable with a large range dominates the clustering.
  • Complete linkage benefits most, producing noticeably more even cluster sizes once features are standardized.
  • Single, average, and centroid linkages remain largely unchanged in pattern.
In [ ]:
 
In [ ]:
 
In [ ]: